SDPA Decode Optimization: Tree Reduce by alingTT · Pull Request #37004 · tenstorrent/tt-metal

alingTT · 2026-02-02T23:31:58Z

Ticket

NA

Problem description

SDPA optimization needed for DS and Llama

What's changed

Reduction was previously O(n-1) time where n was the number of cores in a reducer group.
We can optimize this by using tree reduction where pairs of cores perform reduction. So complexity reduced to O(log(n)). On llama 70b galaxy shapes we see 8.3 us -> 7.4us improvement.

Checklist

APC
BH APC
BH demo
SDPA L2 Nightly

Copilot

Pull request overview

This pull request implements a tree reduction optimization for SDPA (Scaled Dot-Product Attention) decode operations, improving the reduction complexity from O(n-1) to O(log n) where n is the number of cores in a reduction group. The change replaces the flat worker-to-reducer pattern with a binary tree reduction where cores hierarchically combine their attention results.

Changes:

Introduced tree reduction helper functions (count_trailing_zeros, ceil_log2, get_tree_reduction_params) to compute binary tree structure
Modified program factory to compute and pass tree reduction parameters to each core
Updated writer and compute kernels to perform round-by-round tree reduction with proper synchronization
Added semaphore encoding scheme using 4-bit nibbles per round for fine-grained synchronization

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

File	Description
sdpa_decode_program_factory.cpp	Adds tree reduction parameter calculation and passes them to kernels; builds physical core coordinate arrays for tree communication
writer_decode_all.cpp	Implements tree reduction receiving and sending logic with round-based semaphore synchronization
sdpa_flash_decode.cpp	Modifies compute flow to combine child results in tree pattern and handle root vs non-root finalization
reader_decode_all.cpp	Minor cleanup (blank line removal, assert include)

Copilot · 2026-02-06T02:49:12Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/sdpa_decode_program_factory.cpp

+    }
+    // Count trailing ones in vid
+    uint32_t trailing_ones = count_trailing_zeros(~vid);
+    // Root in vid-space is 0  → physical core 0


The comment "Root in vid-space is 0 → physical core 0" is incorrect. Based on the vid mapping (vid = num_cores - 1 - core_id) and root_vid = num_cores - 1, the root has vid = num_cores - 1, not vid = 0. For example, with 8 cores: core 0 has vid=7 (root), core 7 has vid=0 (leaf). The comment should say "Root in vid-space is (num_cores-1) → physical core 0".

Suggested change

// Root in vid-space is 0 → physical core 0

// Root in vid-space is (num_cores_in_group - 1) → physical core 0

Copilot · 2026-02-06T02:49:13Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/dataflow/writer_decode_all.cpp

+            ASSERT(num_heads_per_core == 1);  // if there are workers, then head must be split across workers
+


The ASSERT checking num_heads_per_core == 1 when actual_num_children > 0 may be too restrictive. This assertion fails if a core processes multiple heads and any of those heads require tree reduction. However, the tree reduction logic operates per-head (inside the cur_head loop starting at line 236), so it should support multiple heads per core. Consider removing this assertion or moving it outside the head loop if the intent is to ensure heads aren't split across cores within a single reduction group.

Suggested change

ASSERT(num_heads_per_core == 1); // if there are workers, then head must be split across workers

Copilot · 2026-02-06T02:49:13Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp

            cb_wait_front(cb_index_id, 1);
            cur_pos = read_tile_value(cb_index_id, 0, cur_batch / q_heads_parallel_factor);
-            cb_pop_front(cb_index_id, 1);
+            // cb_pop_front(cb_index_id, 1);


The cb_pop_front for cb_index_id is commented out (line 149), which may cause the circular buffer to fill up if multiple operations try to use it. The comment references issue #27979, suggesting this is a known workaround for a mailbox-based synchronization issue. However, if this buffer is meant to be reused across multiple calls or heads, not popping it will eventually exhaust the buffer. Consider documenting why this pop is commented out and whether it needs to be addressed when the referenced issue is resolved.

Copilot · 2026-02-06T02:49:13Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/dataflow/writer_decode_all.cpp

+            // Senders can return, dont need to participate
+            return;


Early return at line 336 skips processing of remaining heads when num_heads_per_core > 1. The return statement is inside the head loop (lines 236-238), so sending to parent causes the kernel to exit completely rather than continuing to the next head. This breaks multi-head processing. The return should be replaced with a break or continue, or the send logic should be moved outside the head loop if each core only processes one head during tree reduction (which the ASSERT at line 245 suggests).

Suggested change

// Senders can return, dont need to participate

return;

// Senders can stop participating in tree reduction for this head

break;

Copilot · 2026-02-06T02:49:13Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/dataflow/writer_decode_all.cpp

+                    while (true) {
+                        invalidate_l1_cache();
+                        uint32_t sem_val = *in0_receiver_semaphore_addr_ptr;
+                        uint8_t step_sem = (sem_val >> step_semaphore_shift[round]) & 0x0F;
+                        if (step_sem >= 1) {
+                            break;
+                        }
+                    }


The semaphore is polled but never reset within or after the tree reduction rounds. If num_heads_per_core > 1, the outer loop iterates multiple times (lines 236-238), and the semaphore state from the first head will cause the second head to incorrectly think data has arrived. The semaphore needs to be reset after processing all children, or the encoding scheme needs to account for multiple heads per core.

Copilot · 2026-02-06T02:49:14Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp

@@ -499,43 +575,57 @@ void kernel_main() {
                    move_block<true>(cb_cur_sum, cb_prev_sum, Sq_chunk_t);


Missing CB pop operations after consuming child data. The code reads from cb_l_in (line 537) and cb_out_o (line 559) but only pops cb_m_in (line 573). The cb_l_in and cb_out_o circular buffers need cb_pop_front calls to free the consumed tiles, otherwise the CBs will fill up and block on subsequent rounds or heads. Add: cb_pop_front(cb_l_in, Sq_chunk_t) and cb_pop_front(cb_out_o, out_chunk_tiles) after line 575.

Suggested change

move_block<true>(cb_cur_sum, cb_prev_sum, Sq_chunk_t);

move_block<true>(cb_cur_sum, cb_prev_sum, Sq_chunk_t);

// Release consumed child L and O tiles from their circular buffers.

cb_pop_front(cb_l_in, Sq_chunk_t);

cb_pop_front(cb_out_o, out_chunk_tiles);

Copilot · 2026-02-06T02:49:14Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp

+
+            // Free up prev buffers if we used them
+            if (actual_num_children > 0) {
+                cb_pop_front(cb_prev_max, Sq_chunk_t);


Missing cb_pop_front for sum buffer in root finalization. At line 660, only cb_prev_max is popped, but cb_prev_sum is also consumed (used in lines 607, 614, 622, 628) and needs to be freed. When actual_num_children == 0, cb_cur_sum is used instead but also not popped. Add cb_pop_front(sum_cb, Sq_chunk_t) after line 660 to properly release the consumed sum buffer.

Suggested change

cb_pop_front(cb_prev_max, Sq_chunk_t);

cb_pop_front(cb_prev_max, Sq_chunk_t);

// Also free the sum buffer (prev or current), which was consumed via sum_cb

cb_pop_front(sum_cb, Sq_chunk_t);

Copilot · 2026-02-06T02:49:14Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/sdpa_decode_program_factory.cpp

+            // + reducer coords + output coords
+            std::vector<uint32_t> writer_rt_args(16 + MAX_TREE_REDUCTION_ROUNDS + 2 * num_cores_per_head, 0);


The writer runtime args size for idle cores doesn't account for reducer and output core coordinates. The active cores receive additional args via insert operations at lines 1180-1183, appending reduce_core_physical_xs, reduce_core_physical_ys, output_core_physical_xs, and output_core_physical_ys. However, the idle core size calculation at line 1221 only accounts for base args, children_per_round, and group coords. The size should be: 16 + MAX_TREE_REDUCTION_ROUNDS + 2num_cores_per_head + 2num_reducer_cores + 2*num_output_cores.

Suggested change

// + reducer coords + output coords

std::vector<uint32_t> writer_rt_args(16 + MAX_TREE_REDUCTION_ROUNDS + 2 * num_cores_per_head, 0);

// + reducer coords (2*num_reducer_cores) + output coords (2*num_output_cores)

std::vector<uint32_t> writer_rt_args(

16 + MAX_TREE_REDUCTION_ROUNDS + 2 * num_cores_per_head + 2 * num_reducer_cores + 2 * num_output_cores,

0);

Copilot · 2026-02-06T02:49:14Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/dataflow/writer_decode_all.cpp

+    // Semaphore encoding: each round uses a 4-bit field (nibble) in the semaphore value
+    // Round 0: bits 0-3, Round 1: bits 4-7, Round 2: bits 8-11, etc.
+    // step_semaphore_inc[r] = 1 << (r * 4) is the value to add to increment round r's counter
+    constexpr uint32_t step_semaphore_inc[6] = {1, 16, 256, 4096, 65536, 1048576};
+    // step_semaphore_shift[r] = r * 4 is the bit position to read round r's counter
+    constexpr uint32_t step_semaphore_shift[6] = {0, 4, 8, 12, 16, 20};


The semaphore encoding uses 4-bit nibbles per round (lines 65-70), which limits each round's counter to 0-15. This means a parent can receive from at most 15 children per round. With the binary tree structure, each parent receives from at most 1 child per round, so this is sufficient. However, if the tree structure changes in the future to allow multiple children per round, this encoding would fail. Consider adding a compile-time assertion or comment explaining this constraint.

Copilot · 2026-02-06T02:49:14Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/sdpa_decode_program_factory.cpp


+    // Calculate tree reduction parameters
+    // num_tree_reduction_rounds = ceil(log2(num_cores_per_head))
+    uint32_t num_tree_reduction_rounds = ceil_log2(num_cores_per_head);


No runtime validation that num_tree_reduction_rounds doesn't exceed MAX_TREE_REDUCTION_ROUNDS. If num_cores_per_head is greater than 2^MAX_TREE_REDUCTION_ROUNDS (64), the system would silently produce incorrect results or access out-of-bounds array indices. Add a runtime check: TT_FATAL(num_tree_reduction_rounds <= MAX_TREE_REDUCTION_ROUNDS, "Tree reduction rounds {} exceeds maximum {}", num_tree_reduction_rounds, MAX_TREE_REDUCTION_ROUNDS).

Suggested change

uint32_t num_tree_reduction_rounds = ceil_log2(num_cores_per_head);

uint32_t num_tree_reduction_rounds = ceil_log2(num_cores_per_head);

TT_FATAL(

num_tree_reduction_rounds <= MAX_TREE_REDUCTION_ROUNDS,

"Tree reduction rounds {} exceeds maximum {}",

num_tree_reduction_rounds,

MAX_TREE_REDUCTION_ROUNDS);

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

github-actions · 2026-02-06T02:52:25Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/sdpa_decode_program_factory.cpp

+};
+
+inline TreeReductionParams get_tree_reduction_params(uint32_t core_id_in_group, uint32_t num_cores_in_group) {
+    TreeReductionParams params;


⚠️ cppcoreguidelines-pro-type-member-init ⚠️
uninitialized record type: params

Suggested change

TreeReductionParams params;

TreeReductionParams params{};

github-actions · 2026-02-06T02:52:25Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/sdpa_decode_program_factory.cpp

+    for (uint32_t r = 0; r < MAX_TREE_REDUCTION_ROUNDS; r++) {
+        params.children_per_round[r] = UINT32_MAX;


⚠️ modernize-loop-convert ⚠️
use range-based for loop instead

Suggested change

for (uint32_t r = 0; r < MAX_TREE_REDUCTION_ROUNDS; r++) {

params.children_per_round[r] = UINT32_MAX;

for (unsigned int & r : params.children_per_round) {

r = UINT32_MAX;

github-actions · 2026-02-06T02:52:26Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/sdpa_decode_program_factory.cpp

+        for (uint32_t r = 0; r < MAX_TREE_REDUCTION_ROUNDS; ++r) {
+            writer_rt_args.push_back(tree_params.children_per_round[r]);


⚠️ modernize-loop-convert ⚠️
use range-based for loop instead

Suggested change

for (uint32_t r = 0; r < MAX_TREE_REDUCTION_ROUNDS; ++r) {

writer_rt_args.push_back(tree_params.children_per_round[r]);

for (unsigned int r : tree_params.children_per_round) {

writer_rt_args.push_back(r);

github-actions · 2026-02-06T02:52:26Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/sdpa_decode_program_factory.cpp

+        for (uint32_t r = 0; r < MAX_TREE_REDUCTION_ROUNDS; ++r) {
+            compute_rt_args.push_back(tree_params.children_per_round[r]);


⚠️ modernize-loop-convert ⚠️
use range-based for loop instead

Suggested change

for (uint32_t r = 0; r < MAX_TREE_REDUCTION_ROUNDS; ++r) {

compute_rt_args.push_back(tree_params.children_per_round[r]);

for (unsigned int r : tree_params.children_per_round) {

compute_rt_args.push_back(r);

github-actions · 2026-02-06T02:52:26Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/sdpa_decode_program_factory.cpp

+            // writer runtime args - need to match the size with tree reduction params
+            // Base args (16) + children_per_round (MAX_TREE_REDUCTION_ROUNDS) + group coords (2*num_cores_per_head)
+            // + reducer coords + output coords
+            std::vector<uint32_t> writer_rt_args(16 + MAX_TREE_REDUCTION_ROUNDS + 2 * num_cores_per_head, 0);


⚠️ readability-math-missing-parentheses ⚠️
* has higher precedence than +; add parentheses to explicitly specify the order of operations

Suggested change

std::vector<uint32_t> writer_rt_args(16 + MAX_TREE_REDUCTION_ROUNDS + 2 * num_cores_per_head, 0);

std::vector<uint32_t> writer_rt_args(16 + MAX_TREE_REDUCTION_ROUNDS + (2 * num_cores_per_head), 0);

alingTT · 2026-02-09T16:40:26Z

/codeowners ping

tenstorrent-github-bot · 2026-02-09T16:49:52Z

CodeOwners Group Analysis

This PR requires approval from one member of each of the following groups:

Summary: 1 pending groups, 0 approved groups

Group Information:

⏳ tenstorrent/metallium-maintainers-llama-models (Team) - Members: Raymond Kim, Colman Glagovich, Evan Smal, Harry Andrews | Pending approval
📁 Files owned by this team (6 files)

Note: At least one approval from each group is sufficient.

alingTT · 2026-02-09T18:36:55Z

/codeowners ping

tenstorrent-github-bot · 2026-02-09T18:37:42Z

🔄 CodeOwners Summary Updated

✅ CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

tenstorrent-github-bot · 2026-02-09T18:37:51Z

Hi Evan Smal (@esmalTT), Raymond Kim (@tt-rkim), this PR SDPA Decode Optimization: Tree Reduce by Ambrose Ling (@alingTT) needs your approval/review to merge this.

cglagovichTT · 2026-02-09T19:33:07Z

ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp

+                uint32_t child_id = actual_children_per_round[round];
+                if (child_id != UINT32_MAX) {
+                    // Writer kernel handles the semaphore wait and data transfer to cb_m_in, cb_l_in, cb_out_o
+                    // Data arrives in order: m, l, o


why does data arrive in order m, l, o, if l is processed first?

the send order got messed up, should be arriving in order of l, m, o just fixed, will do a few more passes of the code to double check

alingTT force-pushed the aling/tree-reduce branch from aafa675 to 175c336 Compare February 6, 2026 02:39

alingTT marked this pull request as ready for review February 6, 2026 02:42

alingTT requested review from a team as code owners February 6, 2026 02:42

Copilot AI review requested due to automatic review settings February 6, 2026 02:42

Copilot started reviewing on behalf of alingTT February 6, 2026 02:42 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

github-actions bot reviewed Feb 6, 2026

View reviewed changes

cglagovichTT approved these changes Feb 9, 2026

View reviewed changes

alingTT force-pushed the aling/tree-reduce branch from f72cada to dc571f9 Compare February 9, 2026 20:33

alingTT added 4 commits February 12, 2026 22:21

tree reduce on sdpa decode

5b70696

fix hang in prev sum

69cf16d

clean up

0757704

more fixes to tests

0c103ec

alingTT force-pushed the aling/tree-reduce branch from 0fbb0fc to 0c103ec Compare February 12, 2026 22:21

	// Root in vid-space is 0 → physical core 0
	// Root in vid-space is (num_cores_in_group - 1) → physical core 0

		ASSERT(num_heads_per_core == 1); // if there are workers, then head must be split across workers

		@@ -499,43 +575,57 @@ void kernel_main() {
		move_block<true>(cb_cur_sum, cb_prev_sum, Sq_chunk_t);

-                    move_block<true>(cb_cur_sum, cb_prev_sum, Sq_chunk_t);
+                    move_block<true>(cb_cur_sum, cb_prev_sum, Sq_chunk_t);
+                    // Release consumed child L and O tiles from their circular buffers.
+                    cb_pop_front(cb_l_in, Sq_chunk_t);
+                    cb_pop_front(cb_out_o, out_chunk_tiles);

		// + reducer coords + output coords
		std::vector<uint32_t> writer_rt_args(16 + MAX_TREE_REDUCTION_ROUNDS + 2 * num_cores_per_head, 0);

-    uint32_t num_tree_reduction_rounds = ceil_log2(num_cores_per_head);
+    uint32_t num_tree_reduction_rounds = ceil_log2(num_cores_per_head);
+    TT_FATAL(
+        num_tree_reduction_rounds <= MAX_TREE_REDUCTION_ROUNDS,
+        "Tree reduction rounds {} exceeds maximum {}",
+        num_tree_reduction_rounds,
+        MAX_TREE_REDUCTION_ROUNDS);

		for (uint32_t r = 0; r < MAX_TREE_REDUCTION_ROUNDS; r++) {
		params.children_per_round[r] = UINT32_MAX;

		for (uint32_t r = 0; r < MAX_TREE_REDUCTION_ROUNDS; ++r) {
		writer_rt_args.push_back(tree_params.children_per_round[r]);

		for (uint32_t r = 0; r < MAX_TREE_REDUCTION_ROUNDS; ++r) {
		compute_rt_args.push_back(tree_params.children_per_round[r]);

Conversation

alingTT commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Problem description

What's changed

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

alingTT commented Feb 9, 2026

Uh oh!

tenstorrent-github-bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodeOwners Group Analysis

Group Information:

Uh oh!

alingTT commented Feb 9, 2026

Uh oh!

tenstorrent-github-bot commented Feb 9, 2026

🔄 CodeOwners Summary Updated

Uh oh!

tenstorrent-github-bot commented Feb 9, 2026

Uh oh!

cglagovichTT Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

alingTT Feb 9, 2026

Choose a reason for hiding this comment

alingTT commented Feb 2, 2026 •

edited

Loading

tenstorrent-github-bot commented Feb 9, 2026 •

edited

Loading